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Abstract 


A hierarchical gamma process infinite edge 
partition model is proposed to factorize 
the binary adjacency matrix of an un¬ 
weighted undirected relational network un¬ 
der a Bernoulli-Poisson link. The model de¬ 
scribes both homophily and stochastic equiv¬ 
alence, and is scalable to big sparse networks 
by focusing its computation on pairs of linked 
nodes. It can not only discover overlap¬ 
ping communities and inter-community inter¬ 
actions, but also predict missing edges. A 
simplified version omitting inter-community 
interactions is also provided and we reveal 
its interesting connections to existing models. 
The number of communities is automatically 
inferred in a nonparametric Bayesian man¬ 
ner, and efficient inference via Gibbs sam¬ 
pling is derived using novel data augmenta¬ 
tion techniques. Experimental results on four 
real networks demonstrate the models’ scal¬ 
ability and state-of-the-art performance. 


1 INTRODUCTION 


Community detection and link prediction are two im¬ 
portant problems in network analysis. A vast num¬ 
ber of community detection algorithms based on vari¬ 
ous useful heuristics, such as modularity maximization 
(Newman and Girvan 2004) and clique percolation 


(Palla et al. 2005), have been proposed. See Fortunate 


(2010) for a comprehensive review. These algorithms, 
however, are not based on generative models and hence 
usually cannot be used to generate networks and pre¬ 
dict missing edges (links). Moreover, how to set the 
number of communities is a critical issue that has not 
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been well addressed by them. In this paper, we will fit 
unweighted undirected relational networks using non¬ 
parametric Bayesian generative models, which can be 
used to simulate random networks, detect latent over¬ 
lapping communities and community-community in¬ 
teractions, and predict missing edges, with the number 
of communities automatically inferred from the data. 


For a relational network, a community can be consid¬ 
ered as a subset of nodes (vertices) that are densely 
connected to each other but sparsely to the others, 
such as those in a social network, or it can be con¬ 
sidered as a subset of nodes that are sparsely con¬ 
nected to each other but densely connected to the 
nodes belonging to another community, such as those 
in a network consisting of carnivores and herbivores: 
tigers and bears both hunt deers but rarely prey on 
each other. The former phenomenon is usually de¬ 
scribed as assortativity or homophily, while the lat¬ 
ter is known as dissortativity or stochastic equivalence 


(Hoff 2008). As a relational network may exhibit both 


homophily and stochastic equivalence, an algorithm 
capable of modeling both phenomena would usually 
be preferred if no prior information on assortativity is 
available. If analyzing assortative networks with dense 
intra-community connections is the main goal, then 
one may consider an assortative algorithm that models 
homophily but not necessarily stochastic equivalence. 

The stochastic blockmodel (SBM) is a popular la¬ 
tent class model to detect latent communities (Hol¬ 


land et al. 1983 Nowicki and Snijders, 2001). It par¬ 


titions the nodes into disjoint communities, and mod¬ 
els the probability for an edge to exist between two 
nodes solely based on which two communities that 
they belong to. It is simple and scalable, and mod¬ 
els both homophily and stochastic equivalence. In ad¬ 
dition, the infinite relational model, a nonparametric 
Bayesian extension of the SBM based on the Ghinese 


restaurant process (Aldous 1985), allows the number 


of communities to be automatically inferred from the 


data (Kemp et al. 2006). Despite these attractive 


properties, the SBM is restrictive in that communities 
are not allowed to overlap. In practice, however, it is 
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common for a node to belong to multiple communities, 
motivating the development of more advanced latent 
class models, such as the mixed-membership stochas¬ 
tic blockmodel (M MSB) of|Airoldi et ah (2008) and its 
various extensions ( [Gopalan et ah 2012 Kim et ah 


2013). The MMSB generalizes the SBM to allow a 


node to participate in multiple communities, yet since 
it has to infer two community indicators for each pair 
of nodes, regardless of whether an edge exists in that 
pair, its computation grows quadratically as a function 
of the number of nodes N. Moreover, the number of 
communities in the MMSB is a model parameter that 
needs to be carefully selected. 


In this paper, instead of clustering nodes, as in the 
SBM, or clustering all possible edges, as in the MMSB, 
we propose the edge partition model (EPM) to parti¬ 
tion only the observed edges, which readily leads to the 
partition of nodes: if the edges linked to a node are 
partitioned into multiple communities, then the node 
is naturally affiliated with all these communities, and 
could be hard assigned to a single community that has 
the strongest presence in its edges. In contrast to the 
SBM, the EPM allows communities to overlap; and 
in contrast to the MMSB that spends 0{N‘^) compu¬ 
tation clustering all possible edges, the EPM spends 
0{dN) computation partitioning only observed edges, 
where d is the average degree (number of edges) per 
node, leading to notable computational savings as d 
is often much smaller than A/" in a big sparse network 
commonly observed in practice. 


To support a potentially infinite number of commu¬ 
nities and to model both homophily and stochastic 
equivalence in an unweighted undirected relational 
network, we propose a hierarchical gamma process 
(HGP) EPM, which links each observed edge to a la¬ 
tent count using a Bernoulli-Poisson link, and then fac¬ 
torizes the latent N x N random count matrix. The 
HGP supports the EPM to have an infinite dimen¬ 
sional feature vector for each node to describe its affil¬ 
iations with communities, and an infinite dimensional 
square rate matrix, whose diagonal and off-diagonal el¬ 
ements describe the intra- and inter- community inter¬ 
actions, respectively. We also propose a gamma pro¬ 
cess EPM as a simplified version of the HGP-EPM, 
which omits inter-community interactions to gain sim¬ 
pler inference and faster computation at the expense 
of reduced ability to model stochastic equivalence. 


Gonceptually, our idea of directly partitioning edges 
and implicitly partitioning nodes into communities is 


related to the one in Ahn et al. (2010) and Evans and 


Lambiotte (2009). In terms of construction, our EPMs 


are related to the Poisson factor models of IBall et al.l 
(2011) and the Eigenmodel of Hoff (2008). In terms of 
supporting an infinite number of features, our EPMs 


are related to the models in Miller et al. (2009) and 


Morup et al. (2011) that use the Indian buffet pro¬ 


cess of Griffiths and Ghahramani (2005) to support an 
infinite binary feature matrix. The proposed models 
depart from existing ones with several distinctions: 1) 
a Bernoulli-Poisson link connects each edge to a la¬ 
tent count that is further partitioned; 2) a hierarchical 
gamma process is constructed to support an infinite 
number of communities and an infinite-dimensional 
square matrix to describe community-community in¬ 
teractions; 3) two nonparametric Bayesian EPMs are 
constructed to factorize the N x N binary adjacency 
matrix under the Bernoulli-Poisson link, supporting a 
nonnegative feature matrix with an unbounded num¬ 
ber of columns, and at the same time assign each edge 
and hence each node to one or multiple latent commu¬ 
nities; and 4) efficient and scalable Bayesian inference 
via Gibbs sampling is provided. 


2 FACTOR ANALYSIS AND 
BERNOULLI-POISSON LINK 

Our basic idea is to factorize the BINARY network ad¬ 
jacency matrix using tools developed for COUNT data 
analysis, and to discover overlapping communities and 
their interactions by examining how the latent count 
for each edge is partitioned. This section will primarily 
discuss individual model components and their proper¬ 
ties, with hierarchical Bayesian models presented later. 


2.1 Poisson Factor Analysis 

We propose a Poisson factor model for a weighted 
undirected N x N relational network as 

rriij ^ Po = i X]/C2 = l 4^iki^kik24^jk2^ 5 (1) 

where rriij = "nrji is the integer-valued weight 
(observed or latent) that links nodes i and j, 
(0^1 ,..., (/)ix) is the positive feature vector for node 
i, Xk^k 2 = ^k 2 ki is a positive rate, and the symbol 
= denotes “equal by definition.” This model is con¬ 
ceptually simple: with (pik^ measuring how strongly 
node i is affiliated with community ki and Xkik 2 mea¬ 
suring how strongly communities ki and k 2 inter¬ 
act with each other, the product 0i/ei A/c^/e2 0j/c2 mea¬ 
sures how strongly nodes i and j are connected due 
to their affiliations with communities ki and /c2, re¬ 
spectively, and a weighted combination of all intra¬ 
community weights {Xkk}i<k<K and inter-community 
ones {Xk^k 2 }i<kii>^k 2 <K is the expected value of rriij. 

The factor model in § makes intuitive sense. Eor 
example, suppose persons i and j are both residents of 
City Avatar and active members of the Avatar anglers 
Meetup group that organizes fishing trips regularly. In 
addition, persons i is an active member of the Avatar 
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artificial intelligence (AI) Meetup group while person 
j is an active member of the Avatar statistics Meetup 
group. Denoting rriij as the number of times that i and 
j attend the same group meeting in 2015, then due to 
their strong affiliations with the anglers Meetup group, 
rriij would have a large expected value, which is likely 
to be further increased if the AI and statistics Meetup 
groups hold joint events regularly. 

To model an assortative relational network exhibiting 
homophily but not necessarily stochastic equivalence, 
we may omit the inter-community interactions by let¬ 
ting \k^k 2 = 0 7^ ^2 and simplify Q as 

rriij ~ Po (Ef=i rk4>ik4>jkj , (2) 


where = Xkk indicates the prevalence of commu¬ 
nity k, and two nodes with similar latent features are 
encouraged to be linked by an edge with a large weight. 


We note Ball et al. (2011) had examined a model re¬ 
lated to ([^ and briefly mentioned a model related 
to 0 . However, they used a heuristic approach to 
model binary data under the Poisson distribution, did 
not provide a principled way to set the number of 
communities K, and had to create possibly nonexis¬ 
tent self-edges in order to derive tractable expectation- 
maximization (EM) inference. This paper will address 
all these issues rigorously, in a nonparametric Bayesian 
manner, and carefully examine the models in both ([^ 
and 0 and provide efficient Bayesian inference. 


2.2 Bernoulli-Poisson Link 

To use the Poisson factor models in ^ and ([^ for an 
unweighted network with a binary adjacency matrix, 
we introduce a Bernoulli-Poisson (BerPo) link function 
that thresholds a random count at one to obtain a 
random variable in {0,1} as 


b = l{m >1), m rsj Po(A), (3) 


where 5 = 1 if m > 1 and 6 = 0 if m = 0. The intu¬ 
ition is that two nodes are connected if they interact 
at least once. The mathematical motivation is after 
transforming a binary-modeling problem into a count- 
modeling one, one is readily equipped with a rich set 
of statistical tools developed for count data analysis 
using the Poisson and negative binomial distributions. 

If m is marginalized out from ([^, then given A, one 
obtains a Bernoulli random variable as 


b ^ Ber (l — e ^) . 

The conditional posterior of m can be expressed as 


(m|5,A)-5-Po+(A), 


where x ^ Po+(A) follows a truncated Poisson dis¬ 
tribution, with P{x = k) = {1 — /k\ for 

/c G {1, 2,...}. Thus if 5 = 0, then m = 0 almost surely 
(a.s.), and if 6 = 1, then m ^ Po+(A), which can be 
simulated with rejection sampling: if A > 1, we draw 
m ^ Po(A) till m > 1; and if A < 1, we draw both 
n ^ Po(A) and u ^ Unif(0,1) till u < l/(n + 1), and 
then let m = n + 1. The acceptance rate is 1 — e~^ 
if A > 1 and A“^(l — e~^) if A < 1, and reaches its 
minimum, 63.2%, when A = 1. 

The BerPo link shares some similarities with the pro¬ 
bit link that thresholds a normal random variable at 
zero, and the logit link that lets b ^ Ber[e^/(1 + e^)]. 
We advocate the BerPo link as an alternative to the 
probit and logit links since if 5 = 0, then m = 0 a.s., 
which could lead to significant computational savings 
if a considerable proportion of the data are equal to 
zero. In addition, the additive property of the Poisson 
allows us to model the link strength between any two 
nodes by aggregating the contributions of all possible 
intra- and inter- community interactions, and the con- 
jugacy between the Poisson and gamma distributions 
makes it convenient to construct hierarchical Bayesian 
models amenable to posterior simulation. 

2.3 Overlapping Community Structures 

Note that 0 can be augmented as 

rnikik2j 1 'knikik23 ~ Po Afc^fc2 0jA:2) • (4) 

fci k2 

where mik^k 2 j represents how often nodes i and j in¬ 
teract due to their affiliations with communities ki and 
/c2, respectively. We may consider that the model is 
partitioning the count rriij into {mik^k2j}i<kuk2<K, 
and hence we call the Poisson factor model in ^ to¬ 
gether with the BerPo link in ^ as an edge partition 
model (EPM), in which each edge is partitioned ac¬ 
cording to all possible community-community in¬ 

teractions, and how strongly node i is affiliated with 
community k can be measured with (pik^ik^ where 

^ik •= S/c' ^jk'^kk' (5) 

represents how strongly node i would interact with all 
the other nodes through its affiliation with commu¬ 
nity k. We further introduce the latent count 

•= X]/c2 '^ikk2j + X]/ci "^jkiki^ (6) 

to represent how often node i is connected to the other 
nodes due to its affiliation with community k. We 
can then assign node i to multiple communities in 
{k : nriik.. > 1}, or (hard) assign it to a single com¬ 
munity using either argmax(0^/eCo’^/c) or argmax(m^/c..). 

k k 

Similar analysis applies to a simpler EPM built on 0 . 
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By hard assigning each node to a single community 
and ordering the nodes from the same community to 
be adjacent to each other, we expect the ordered ad¬ 
jacency matrix to exhibit a block structure, where the 
blocks along and off the diagonal represent the intra- 
and inter- community connections, respectively. 


2.4 Scalability for Big Sparse Networks 


We are motivated to construct the EPMs because they 
not only allow each edge and hence each node to par¬ 
ticipate in multiple communities, but also readily scale 
to a big sparse network whose average degree per node 
is much smaller than N. A key observation for scal¬ 
able computation is that 0 can be augmented as 
Q, where mik^k 2 j = 0 for any ki and k 2 if no 
edge exists between nodes i and j (i.e., rriij = 0 ). 
On a sparse network, where the edges constitute only 
a small portion of all possible N‘^ edges, this prop¬ 
erty makes the EPMs computationally appealing. By 
contrast, conceptually related models, including the 


(2008k and latent feature relational model of Miller 


MMSB of Airoldi et al. (2008), Eigenmodel of Hoff 


et al.| (2009), spend computation indiscriminately on 


all pairs of nodes (i,j) no matter whether an edge ex¬ 
ists between nodes i and j, and hence they have 0{N‘^) 
computation and do not scale well as N increases. 


set A C ft ( [Eerguson 1973[ Kingman 1993). The 
Levy measure of the gamma process can be expressed 
as u{drd(f)) = r~^e~^^^drGo{d(l)), and a draw from the 
gamma process, consisting of countably infinite atoms, 

can be expressed as G = where 0;. 

Go = 70^0 5 9o{d<P) = Go{dcj))/jo is the base distri¬ 
bution, and 7 o = Go{ft) is the mass parameter. A 
gamma process based model has an inherent shrinkage 
mechanism, as in the prior the number of atoms with 
r/c greater than 5 G follows Po( 7 o r~^e~^^dr)^ 
whose Poisson rate decreases as e increases. 

Given G = further define a relational 

gamma process (rEP) as 

A|G^rrP(G,e,l//3), (8) 

a draw from which is defined as 
^ = 5^/ei=l S/C2 = l 

where ^ and p are both in M+, \k 2 k 1 = ^kik 2 ^ 


^kik2 


( Gam(^r/,,, 1//3) , if /c 2 = /ci, 
[Gam(r/e,r/c 2 ,1//5), if k2 > ki. 


Given a relational gamma process draw A, we generate 
a binary adjacency matrix B G {0,1}^^^ as 


3 EDGE PARTITION MODELS 

3.1 Hierarchical Gamma Process 


B|A ^ Ber 


1 - 


00 00 

n n exp 


ki=l k2 = l 


• (9) 


The EPM takes a weighted combination of all pos¬ 
sible intra- and inter- community weights to explain 
each pair of node, however, the number of commu¬ 
nities K is still a model parameter that needs to be 
set appropriately. To allow K to be inferred from 
the data and potentially grow to infinity, we need 
to introduce a stochastic process that can generate a 
countably infinite number of atoms {(t>k} 1 , 00 ^ where 
0/c = ( 01 /C 5 • • • 5 0Ar/c)^ measures how strongly the N 
nodes are affiliated with community k, and an infinite 
dimensional square matrix {Xkik 2 }i<ki,k 2 <oo^ where 
Xk 2 ki = A/C 1 /C 2 measures how strongly communities ki 
and k 2 interact with each other. Moreover, we need 
to ensure ^kik 2 be finite a.s. and we 

may wish to impose some structural regularization on 
the infinite square matrix. 

To satisfy all these needs, we first define 

G^rP(Go,l/co) (7) 

as a gamma process on a product space x ft, 
where = {x : x > 0}, ft is a complete sepa¬ 
rable metric space, I/cq is a positive scale parame¬ 
ter, and Go is a finite and continuous base measure, 
such that G{A) ^ Gam(Go(A), I/cq) for each Borel 


Equations 0 , 0 and 0 constitute an HGP-EPM 
that supports countably infinite atoms and a count¬ 
ably infinite square matrix, the total sum of whose 
elements has a finite expectation, as shown in the fol¬ 
lowing Lemma, with proof provided in the Appendix. 

Lemma 1. The expectation of 
finite and can be expressed as 


E 


fe2 


■- k 


= —770 

Cofi 


1 




The usual scenario to consider an HGP construction 
is when one models grouped data and wishes to share 
statistical strengths across groups. Eor example, the 
gamma-negative binomial process of |Zhou and Garm] 
the hierarchical Dirichlet process of 
, is considered for topic modeling, 
where each document is associated with a gamma pro¬ 
cess, and these gamma processes are coupled by shar¬ 
ing a lower-level {i.e., further from the data) gamma 
process as their atomic base measure. The proposed 
HGP is distinct in that the product of the weights 
of any two atoms of the lower-level gamma process is 
used to parameterize the shape parameter of a gamma 
random variable higher in the hierarchy. 


( 2012 ), related to 


Teh et al. (2006 
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The proposed HGP also helps express our prior be¬ 
lief that an atom with a small weight tends to repre¬ 
sent a small community, which also tends to interact 
with the others less frequently. Note that if we let 
A/c/c ^ Gam(r^, 1//3), then the expectation of the ma¬ 
trix {A/ci/e2}i</ci,/e2<oo given {r/c}i,oo has a rank of one. 
We use instead of as the shape parameter of \kk 
to allow Tk to be inferred with Gibbs sampling and to 
prevent overly shrinking X^k for small communities. 
Note that Palla et al. (2014) proposed a reversible in¬ 


finite hidden Markov model using a related HGP in¬ 
finite square rate matrix, the normalization of whose 
each row represents a state transition probability vec¬ 
tor. Our HGP serves a distinct modeling purpose; no 
normalization is required for the infinite square rate 
matrix, and our model allows exploiting unique data 
augmentation techniques to infer both \k^k 2 Vk 
with closed-form Gibbs sampling update equations, as 
discussed in Section |3.4| and the Appendix. 


3.2 Hierarchical Gamma Process EPM 

We choose the base distribution of the gamma pro¬ 
cess G ^ rP(Go, 1/co) as go{(f)) = Gam(ai, 1 /q). 
For implementation convenience, we consider a dis¬ 
crete base measure as Go = Ylk=i where K is 

a truncation level that is set large enough to ensure a 
good approximation to the truly infinite model. We 
express the (truncated) HGP-EPM as 

K K 

hij = XilTlij ^ 1), TTlij = ^ ^ ^ ^ '^i/ci/c2i: 

/ci = l /C2 = l 

'^ikik2j ^ Po (0i/ciA/ci/c2 0i/c2) 5 

^ Gam(ai, 1 /q), ^ Gam(eo, l//o), 

rGam(^r/ci,l/^), if k 2 = h, 

A/ci/C2 ^ \ 

[Gam(r/e^r/e2,1//^), h fe > 

Vk ^ Gam( 7 o/i^, I/cq), ( 10 ) 

where Xk 2 ki = ^kik 2 conjugate gamma priors are 
imposed on 70 , cq, q and Note that marginalizing 


out both rriij and mik^k 2 j from ( 10 ) leads to 


bij 


Her 


K K 


n n exp( 0i/ei 0 jf/e 2 ) 


ki = l /C2 = l 


. ( 11 ) 


A noticeable advantage of the augmented representa¬ 


tion in ( 10 ) over ( 11 ) is that ( 10 ) is amenable to pos¬ 


terior simulation, as discussed in Section [3^ 


Note that similar to Hoff (2008) and Lloyd et al. 


( 2012 ), we assume that the nodes are exchangeable 


and hence the discussions of Hoover (1982) and Aldous 


3.3 Gamma Process EPM 

If we omit inter-community interactions by letting 
Xk^k 2 = 0 for ki 7 ^ k 2 and Xkk = '^/c, then the HGP- 
EPM reduces to a gamma process EPM (GP-EPM), 
which is likely to well fit assortative networks but not 
necessarily disassortative ones. We notice an inter¬ 
esting connection to the community-affiliation graph 


model (AGM) of Yang and Leskovec (2012 2014): the 
GP-EPM generates an edge with probability 

P{bij = l) = 1- rifc {1 - [l-exp(-rfe(/)ife(/)jfc)]}; (12) 

if we define pk = I — and further impose the 
restriction that (pik G { 0 , 1 }, then (p!^ reduces to 


= 1 ) = 1 - n 


keCi 


(1 -Pk) ■ 


(13) 


where Cij = {k : (pik = 1 and (pjk = 1} C {1,..., K} 
is a set of communities that nodes i and j share; note 


that (13) is almost the same as the AGM of Yang and 
Leskovec (2012 2014| ). In fact, one may consider the 


GP-EPM with bij ^ Ber[l - e~^ H/c exp{-rk(t)ik(t)jk)], 
where e G and (f)ik G {0,1}, as a nonparametric 
Bayesian AGM. Similarly, we also notice that of 
the HGP-EPM is related to the model of |Morup et al. 
(2011) if we restrict <pik G {0,1}. 

Yang and Leskovec ( 2012[ |2014 ) argue that all pre¬ 
vious community detection methods, including clique 
percolation and MMSB, would fail to detect communi¬ 
ties with dense overlaps, because they all had a hidden 
assumption that a community’s overlapping parts are 
less densely connected than its non-overlapping ones. 
The same as the AGM, both the GP-EPM and HGP- 
EPM do not make such a restrictive assumption, and 
they both allow overlaps of communities to be denser 
than communities themselves; Beyond the AGM, we 
do not restrict (pik to be either zero or one, and our 
generative models are built under a rigorous nonpara¬ 
metric Bayesian framework with efficient Bayesian in¬ 
ference, as presented below. 

3.4 MCMC Inference 

In this paper, we consider an unweighted undirected 
network, where hji = hij and self-links bu are not de¬ 


fined. Thus we only consider hij for j > i in (10). Let 
rriik.. be defined as in ^ and m.k^k 2 - 


m 


feife2- := 2 +mik^k^j), 


(1985) on exchangeability also apply to our EPMs. 


where ^/ci/c 2 = I if ki = k 2 and 6 k^k 2 = 0 otherwise. 
Using <§ and the Poisson additive property, we have 

rriik.. ~ Po{(t)iktOik), (14) 

/C2 ^k\k2 ), (15) 

where ek^k 2 '■= 2“'’'‘i''2 represents 

how strongly the nodes interact through communi¬ 

ties ki and /c 2 . Marginalizing out (^ik from (14) 
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and Xkik 2 from ([^, with p';, := uiikUci + coik) and 
Pkik 2 ■= 0kik2/{l3 + 0 kik 2 ), we have 

TOjfc.. ~ NB (aj,p'j,), (16) 

m.k,k2- NB • (17) 


Using the BerPo link, the gamma-Poisson conjugacy, 
and the augment-and-conquer techniques to infer the 
dispersion parameters ( |Zhon arid 
), we exploit ( p^ - 0 to derive 
sampling update equations for all 
model parameters except 70 , and construct an excel¬ 
lent proposal distribution to sample 70 using an in¬ 
dependence chain Metropolis-Hastings algorithm. We 
present in the Appendix the details of MCMC infer¬ 
ence for the HGP-EPM, and the hierarchical model 
and closed-form Gibbs sampling update equations for 
the GP-EPM. The inference of the nonparametric 
Bayesian AGM would be almost the same as that of 
the GP-EPM, with the only difference that the (0i/e|—) 
would be sampled from Bernoulli distributions. 


negative binomial 


Garin 2012 2015 


closed-form Gibbs 


4 EXPERIMENTAL RESULTS 


Eor comparison, we consider the infinite relational 
model (IRM) ofjKemp et al.|(|2006|), the Eigenmodel of 
Hoff|(| 2 ^, the infinite latent attribute (ILA) model 


of Palla et al. ( | 2012 ), the AGM of Yang and Leskovec 


( 2 Q 12 [ |2014| ), and our GP- and HGP-EPMs. We use 
the R package provided f or the Eigenmode l. We use 
the ILA cod^provided for Palla et al. ( 2012 ), in which 
it is shown that the ILA outperforms the related non¬ 
parametric latent feature relational model of [Miller 


et al. (2009). We implement a nonparametric Bayesian 


version of the AGM as a special case of the GP-EPM, 
as discussed in Section [T3l Matlab code for the EPMs 
is available on the author’s website. 


Eor the Eigenmodel, we find the best K in 
{5,10, 25, 50}. Eor the ILA, we use its default param¬ 
eter setting Eor the IRM, we choose Beta(0.1,1) as 
the prior for each latent block and Gam( 0 . 01 , 1 / 0 . 01 ) 
as the prior for the Ghinese restaurant process con¬ 
centration parameter; for the nonparametric Bayesian 
AGM, we let (pik ^ Ber( 7 ri), tt^ ~ Beta(0.01, 0.01) and 
e ^ Gam( 0 . 01 , 1 / 0 . 01 ); these parameters are found to 
consistently provide good performance. Eor our mod¬ 
els’ hyper-parameters, we choose eo = /o = 0.01 and 
let 7 o, Q, Co and (3 be all drawn from Gam(l, 1 ). 

We consider 3000 MGMG iterations and collect the last 
1500 samples, unless otherwise stated. We consider 

^ http://mlg.eng.cam.ac.uk/konstantina/ILA/ILA_code(vl).tar.gz 

^The default training/testing partition of the ILA code 
sends self-edges into the testing set; whereas in this paper, 
we do not intend to predict self-edges and hence we do not 
allow them to appear in the testing set. 


two small-scale benchmark networks, for which we test 
all algorithms and set the truncation level as i^max = 
100 for our algorithms, and another two networks with 
more than 2000 nodes, for which we set LCmax = 256. 


To test a model’s ability to predict missing edges of 
an unweighted undirected relational network, we ran- 
doml}!^ hold out 20 % pairs of nodes and use the the 
remaining 80% to predict the probability for an edge 
to exist in each of these held-out pairs. Letting oij = 0 
if hij is held out and Oij = 1 otherwise, we only need 
to slightly modify the inference by only considering 


{h 


= 1} in the likelihood. Eor example, ujik in 


^ would be redefined as cuik = Y^k' Yk'(t>jk'- 

e consider exactly the same five random training¬ 
testing partitions for all algorithms and report the 
average area under the curve (AUG) of both the re¬ 
ceiver operating characteristic (ROG) and precision- 
recall (PR) curves ( Davis and Goadrich[ 2006). Eor 
link prediction, the AUG-PR is more sensitive to the 
percentage of true edges among the top ranked ones. 
Note that in addition to link prediction, the HGP- 
EPM, GP-EPM, AGM and IRM all have easily in¬ 
terpretable latent representations that will be used to 
detect overlapping/disjoint communities. 


4.1 Protein230 Network 


We first consider the Protein230 dataset of IBntlandl 


et al. (2005) that describes the interactions between 


230 proteins, with 595 edges. This is a small-scale 
benchmark network that exhibits both homophily and 
stochastic equivalence, as shown in Hoff (2008) and 
also tested in Lloyd et al. ( 2012 ). We are able to 
run 3000 MGMG iterations quickly enough for all al¬ 
gorithms except for the ILA on this network. 


As shown in Tab. the HGP-EPM has the best over¬ 
all performance. The Eigenmodel is the second best 
with K = 10 and the IRM is the third best. The AGM 
is not competitive as it restricts its features to be bi¬ 
nary. In this and all future tables, we highlight in bold 
both the best result and the ones that are less than one 
standard error away from the best. Below we analyze 
why the HGP-EPM performs the best while the sim¬ 
pler GP-EPM is not that competitive on this dataset. 

As shown in Eigs. [^(b)-(d), the HGP-EPM captures 
both homophily and stochastic equivalence by accu¬ 
rately modeling both diagonal and off-diagonal dense 
regions of the adjacency matrix; the GP-EPM captures 
homophily by accurately modeling diagonal dense re¬ 
gions that represent intra-community interactions, but 
at the expense of creating nonexistent blocks in order 
to fit dense off-diagonal regions that represent strong 


^If removing an edge disconnects a node to all the oth¬ 
ers, then the edge will be kept in the training set. 
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(a) Protein interaction network 



50 too 150 200 

(c) GP-EPM 
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Figure 2: The inferred latent feature matrix and 

community-community interaction rate matrix {\k 1 k 2 } for 
the HGP-EPM on Protein230. The nodes are reordered to 
make a node with a larger index belong to the same or a 
smaller-size community, and the communities are ordered 
to make a community with a larger index to have a smaller 
size. The pixel values are displayed on the log-10 scale. 


4.2 NIPS234 Coauthor Network 


Figure 1: Comparison of three models on estimating the 
link probabilities for the Protein230 network using 80% of 
its node pairs. The nodes are reordered to make a node 
with a larger index belong to the same or a smaller-size 
community, where the disjoint community assignments are 
obtained by analyzing the results of the HGP-EPM. (a) 
The binary adjacency matrix, (b)-(c) Estimated link prob¬ 
abilities displayed on the log-10 scale from —2 to 1, with a 
brighter color representing a higher link probability. 

Table 1: Comparison of six algorithms on predicting miss¬ 
ing edges of the Protein230 network. The Eigenmodel 
achieves its best performance at K — 10. 


Model 

AUG-ROG 

AUG-PR 

IRM 

0.9338 ± 0.0128 

0.5026 ± 0.0676 

Eigenmodel 

0.9314 ± 0.0188 

0.5468 ± 0.0500 

ILA 

0.8971 ± 0.0297 

0.3693 ± 0.0234 

AGM 

0.9145 ± 0.0160 

0.3339 ± 0.0359 

GP-EPM 

0.9335 ± 0.0110 

0.4011 ± 0.0452 

HGP-EPM 

0.9519 ± 0.0100 

0.5655 ± 0.0505 


inter-community interactions; and the IRM captures 
these large dense blocks, but produces a cartoonish 
estimation, which overlooks small communities that 
represent fine details along the diagonal. 

Fig. shows how the HGP-EPM works. First, 
each feature vector 0^. shown in Fig. (a) clearly 
describes how strongly the nodes are affiliated with 
the community it represents, and each node may 
have large weights on multiple community. Sec¬ 
ond, about 30 latent feature vectors are inferred and 
the remaining ones are essentially drawn from the 
prior Gam(ai, 1 /q). Third, the inter- and intra¬ 
community interaction strengths in Fig. (b) can be 
matched to the corresponding communities (subsets of 
nodes) in Figs, [^(a) and (b). For example. Fig. |^(a) 
suggests that the first and second largest communities 
have 24 and 22 nodes, respectively, and Fig. i(b) 
suggests that the first and second communities have 
sparse and dense intra-community connections, respec¬ 
tively, and have denser connections between them, as 
confirmed by examining the block structures within 
the top-left 46 x 46 corner of both Figs, [^(a) and (b). 


We consider the small-scale NIPS234 network consists 
of the top 234 authors in NIPS 1-17 conference^ in 
terms of the number of publications, as studied in 


Miller et al. (2009). There are 598 edges. As shown 
in Tab. the GP-EPM and HGP-EPM have the best 
overall performance, followed by the IRM. Compar¬ 
ing with the simpler GP-EPM, the extra flexibility to 
model stochastic equivalence does not bring the HGP- 
EPM additional advantages on this dataset, which is 
not surprising as Fig. suggests that this coauthor 
network mainly exhibits homophily. Note that the 
IRM performs well measured by the AUC-ROC, but 
its AUC-PR is clearly worse than those of the EPMs. 
This may again be explained by its overly smoothed 
cartoonish estimation that overlooks small communi¬ 
ties, as clearly shown in Fig. [^(d). 


Table 2: Gomparison of six algorithms on predicting miss¬ 
ing edges of the NIPS234 coauthor network. The Eigen¬ 
model achieves its best performance at K — 10. 


Model 

AUG-ROG 

AUG-PR 

IRM 

0.9476 ± 0.0114 

0.6677 ± 0.0201 

Eigenmodel 

0.9269 ± 0.0177 

0.6784 ± 0.0364 

ILA 

0.9171 ± 0.0222 

0.6793 ± 0.0295 

AGM 

0.8906 ± 0.0164 

0.5842 ± 0.0357 

GP-EPM 

0.9501 ± 0.0123 

0.7415 ± 0.0319 

HGP-EPM 

0.9469 ± 0.0163 

0.7289 ± 0.0540 


4.3 Yeast and NIPS12 Networks 


We also consider the Yeasf[^ protein interaction net¬ 
work of Bu et al. ( 2QQ3| ), with 2361 nodes and 6646 
non-self edges, and the NIPS12 coauthor networl^ 
that includes all the 2037 authors in NIPS papers vols 
0-12, with 3134 edges. These two median-size net¬ 
works are already too large for the Eigenmodel and 
ILA to produce reasonable results given our computa¬ 
tional resources. The results in Tabs. [3] and [4] show 


^ http://chechiklab.biu.ac.il/~gal/data.html 

g 

http: / /vlado.fmf.uni-lj. si/pub/networks/data/bio/Yeast/Yeast.htm 

® http://www.cs.nyu.edu/~roweis/data.html 
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(a) NIPS234 coauthor network 


(b) HGP-EPM 



(c) GP-EPM 
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Figure 3: Comparison of three models on estimating the 
link probabilities for the NIPS234 coauthor network, (a)- 
(d) Analogous plots to Figures (a)-(d). 




Table 3: Comparison of four algorithms on predicting 
missing edges of the Yeast protein interaction network. 


Model 

AUC-ROC 

AUC-PR 

IRM 

0.9093 ± 0.0059 

0.1878 ± 0.0142 

AGM 

0.9009 ± 0.0025 

0.1225 ± 0.0129 

GP-EPM 

0.9331 ± 0.0014 

0.2486 ± 0.0149 

HGP-EPM 

0.9367 ± 0.0012 

0.2628 ± 0.0184 


Table 4: Comparison of four algorithms on predicting 


missing edges of the NIPS 12 coauthor network. 


Model 

AUC-ROC 

AUC-PR 

IRM 

0.9427 ± 0.0121 

0.2066 ± 0.0331 

AGM 

0.9328 ± 0.0049 

0.2350 ± 0.0177 

GP-EPM 

0.9768 ± 0.0079 

0.4705 ± 0.0362 

HGP-EPM 

0.9762 ± 0.0081 

0.4493 ± 0.0229 



Rank of a community Rank of a community 

Figure 4: A community’ size and its rank. 

model and ILA have at least 0{N‘^ + NK) compu¬ 
tation, where K is the number of latent features. 
With unoptimized Matlab on a 2.7 GHz CPU, for 
1000 MCMC iterations, the HGP-EPM (GP-EPM) 
takes about 80 (20) seconds on Protein230, about 85 
(28) seconds on NIPS234, about 50 (18) minutes on 
Yeast, and about 32 (12) minutes on NIPS12. The 
Eigenmodel with K = 2b takes about 200 seconds 
on NIPS234 to run 1000 MCMC iterations. Eor the 
ILA on NIPS234, we considered 1000 MCMC itera¬ 
tions that took over 18 hours to run; for Protein230, 
the ILA inferred about two times more features as it 
did on NIPS234, and we considered 500 MCMC itera¬ 
tions that took over 21 hours to run. 


5 CONCLUSIONS 


that the HGP-EPM performs the best on the Yeast 
protein-protein interaction network, which is found to 
clearly exhibit stochastic equivalence by examining the 
plots corresponding to the ones in Eigs. and (not 
shown for brevity), and the HGP-EPM and GP-EPM 
both perform well on the NIPS 12 coauthor network, 
which is found to mainly exhibit homophily by exam¬ 
ining related plots (not shown for brevity). 

As discussed before, the HGP-EPM, GP-EPM, AGM 
and IRM can all be used to assign nodes to disjoint 
communities. In Eig. |^we plot the size of an inferred 
latent community as a function of its rank (smaller 
ranks indicate larger sizes) on the log-10 scale, for the 
four scalable algorithms on the four tested real net¬ 
works. It is clear that in contrast to the other three 
latent factor models, the IRM, a latent class model, 
infers a smaller number of communities, with more 
larger-size and fewer smaller-size ones. Examining the 
details we find that the IRM tends to place all the 
low-degree nodes into one or several large-size com¬ 
munities, whereas the other models are able to better 
preserve fine details involving small-size communities. 

We mention that the HGP-EPM, GP-EPM and AGM 
have 0{Nd + NK) computation, whereas the Eigen- 


To model unweighted undirected relational networks 
characterized by both homophily and stochastic equiv¬ 
alence, we propose a hierarchical gamma process edge 
partition model (EPM) that supports an infinite num¬ 
ber of communities and an infinite square rate matrix 
to describe community-community interactions. The 
EPM exploits a Bernoulli-Poisson link to assign a la¬ 
tent count to each binary edge, and further partitions 
that count according to the edge’s affiliations with all 
pairs of communities, which naturally leads to the par¬ 
tition of each node into overlapping communities. We 
also provide a simpler gamma process EPM that omits 
inter-community interactions, which is found to per¬ 
form well on assortative networks. Efficient MCMC 
inference with closed-form update equations is pro¬ 
vided. Experimental results on four real networks il¬ 
lustrate the EPMs’ working mechanisms and proper¬ 
ties, as well as their state-of-the-art performance and 
interpretable latent representations. While previous 
latent feature relational models and their nonparamet- 
ric Bayesian versions are often not scalable, our infinite 
EPMs are readily scalable to networks with thousands 
of nodes. It would be interesting to investigate strate¬ 
gies to make them scalable to relational networks with 
millions of nodes and edges. 
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Infinite Edge Partition Models for Overlapping 
Community Detection and Link Prediction: 
Appendix 


A Proof for Lemma 1 

Using the law of total expectation, we have 


E 


. ki k 2 


4" 


fG(n) + [G(n)p-2;> 


Using Campbell’s theorem (Kingman 1993), we have 

Jn Jo 


E 


L k 


^ 2 ^-lg-cor 


drGo{d(jo) = 

^0 


The proof is completed by further using E[G(U)] = 
7 o/co and E[G^(U)] = yg/cg + To/cq- □ 

B MCMC Inference for HGP-EPM 


Sample rriij. As in Section 2 . 2 , we sample a latent 


count for each bij as 


/ K K \ 

{rriij \-) ^ bijFo^ EE 0i/ci ^/Ci/C2 0i/e2 I • 

\ki = l /C 2 = l / 

Sample mik^k 2 j- Using the relationships between the 
Poisson and multinomial distributions, similar to the 
derivation in Zhou et ah ( 2012 ), we partition the latent 
count mu as 


{{m 


ikik 2 j} 


}|-) - Multfmy; 


V ^ X]/ci S/C 2 / 

(19) 


Note that in each MCMC iteration we store mik.. and 
m.k^k 2 ' but not necessarily mik^k 2 j fbe memory. 
Sample ai. Using ( p!^ and the data a ugmentation 
technique developed in Zhou and Carin (2012 2015) 
for the negative binomial distribution, we sample as 


rriik.. 

t=i 


t-1 


(a,|-) -- Gam( cq + y^Jik, 


fo-Ek H^-P'ik) 


( 20 ) 


where with a slight abuse of notation, but for added 
conciseness, we use x ~ Ber[a/(a + t)] to repre¬ 
sent x = ^ Ber[a/(a + t)]. 

Sample (pik- Using ([T^ and the gamma-Poisson con- 
jugacy, we have 


{4^ik I ) ^ Gam ^ai T \ j{ci T 


Sample Vk- Similar to the inference of a^, using 0 - 
we sample Vk as 


kk 2 • 

(^fefeal-) E 

t=l 


rk^^>‘>‘^ (j.j.Jl-'Sfefca + t- I 


(rkl-) ~ Gam 


ki 

1 


Co - Efeo (^fe 2 ) ^ In (1 - Pfcfe J. 


( 22 ) 


Sample We resample the auxiliary variables l^k 
using the updated Vk and then sample ^ as 


(Cl“) ^ Gam 


eo 




/o - EfcCfcln(l -Pkk). ' 
(23) 

Sample Xkik 2 - Using (15) and the gamma-Poisson 
conjugacy, we have 


l/(/3 -h 6 >/ci/c2) • 
(24) 


Sample (3, Ci and cq. They can be sampled from 
gamma distributions using the conjugacy between 
gamma distributions, omitted here for brevity. 
Sample jq. As show in Lemma the mass parame¬ 
ter 7 o plays an important role in determining the total 
sum of the infinite rate matrix {Xkik 2 }' experi¬ 

ments show that it could be used as a tuning parameter 
to impose one’s prior preference on the number of ac¬ 
tive communities to be discovered. In this paper, we 
impose a gamma prior as 70 ^ Gam(l, 1 ) to let the 
data infer the posterior of 70 . We employ an indepen¬ 
dence chain Metropolis-Bastings algorithm to sample 
7 o, with the proposal distribution constructed as 


Qho) = Gam(l + E4, . 1 ^ , 

V V 1 - i? Efc ln(l - Pfe) 

where (4]-) -- CRT (Efe^ hk-^.'lo/K) and 


(25) 


- Efc, In (1 - Pkk,) 

Co - Efea In (1 - Pkk 2 ) ' 


We accept 7 o with probability min{l, 7 r}, where tt is 


nLiGam(rfc; 7 g/.R, l/co)Gam( 7 g; 1, l)Q(7o) 
nf=iGani(rfe; 7 o/R:, l/co)Gani( 7 o; 1, l)Q(7o) ’ 


which is usually greater than 50% for the networks 
considered in this paper. 


Each iteration of the MCMC for the HGP-EPM pro¬ 
ceeds from (Usl) to ([ 2 ^. 
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C Gamma Process EPM 


Marginalizing out (pik from (27), we have 


The gamma process EPM differs from the HGP-EPM 
in that it omits inter-community interactions, which 
leads to a simpler hierarchical model and faster com¬ 
putation at the expense of reduced ability to model 
stochastic equivalence. It is found to have good per¬ 
formance on assortative networks but not necessarily 
on disassortative ones. 


rriik.. ~NB {ai,p'ik), 


where 

/ _ 

■ Ci+Tk (l>jk ■ 

Marginalizing out Vk from ( [^ , we have 


(29) 


C.l Hierarchical Model 


m..k ^ NB ( 70 /i^, Pk ), (30) 


The (truncated) gamma process EPM is expressed as 

hij — l(777/2jf ^ 1 ), 

K 

'^ij ^ ^ '^ijk 1 ^ijk ^ Po (r/j^^/c^jf/c) , 

k=l 

4>ik ^ Gam(ai, I/q), ^ Gam(eo, I// 0 ), 

Vk -- Gam(7o/i^, I/cq), 70 ^ Gam(ei, I//1),. (26) 

where the Gam(l, 1 ) prior is also imposed on cq and 
Q. As ^ 00 , we recover the (exact) gamma process 
with a finite and continuous base measure. We usually 
set K to be large enough to ensure a good approxima¬ 
tion to the truly infinite model. 


Note that if we marginalize out both rriij and rriijk^ 
then we have 

K 

1 - n i~'^k(pik(pjk) ■ 

k=l 


bij ^ Bernoulli 


C.2 Gibbs Sampling 


Let the latent counts rrii.k and m..k be defined as 

N i -1 

'^i'k •— ^ ^ '^ijk T ^ ^ '^jik i 


j=i+l 
N N 


i=i 


N 


m..k := E E = \ ^'^i-k ■ 

i=l i=l 

Using the Poisson additive property, we have 
rrii.k ^ Po ^k(t>ik ^ (t>jk^ , 


m..k -- Po r/c 


Si 4^ik4^jk \ 

-^- ■ 


(27) 

(28) 


where 


Si Sjyi 4^ik<pjk 

2Co T Si 4^ik4^jk 


Similar to the inference techniques used in Appendix 
B, one may exploit (27)-(30) to derive closed-form 
Gibbs sampling update equations for all model param¬ 
eters, omitted here for brevity. 


D Gamma Process AGM 

Closely related to the gamma process EPM, the hi¬ 
erarchical model for the (truncated) gamma process 
AGM can be expressed as 

bij l(777/ijf ^ 1 ), 

K 

kriij — Uij -\r ^ ^ kTlijk 1 kflijk ^ Po {r'k4^ik4^jk^ i 
k=l 

Uij Po(e), e -- Gam(ao, l/^o), 

(j)ik ^ Ber( 7 ri), TTi ~ Beta(ai, 6 i), 

Vk ^ Gam( 7 o/Ar, I/cq), 70 ^ Gam(ei, I//1). (31) 


We sample r/e, 70 and cq in the same way we sam¬ 
ple them in the gamma process EPM. To sample (pik^ 
one may use (27) as the likelihood, under which 
is equal to one a.s. if rrii.k > 0 and is drawn from 
a Bernoulli distribution if rrii.k = 0- Gibbs sampling 
update equations for the other model parameters can 
be conviniently derived by exploiting conditional con- 
jugacies, omitted here for brevity. 













